library(ggplot2)
library(dplyr)
library(GGally)
library(scales)
library(memisc)
library(reshape)
library(gridExtra)
# Load the Data
setwd("~/WorkSpace/Udemy-Datascience/IntroductionToDataScience/P4-EDA/")
df <- read.csv('wineQualityReds.csv')
In this exercise, I will explore a data set on wine quality and physicochemical properties. The objective is to explore which chemical properties influence the quality of red wines. I’ll start by exploring the data using the statistical program, R. As interesting relationships in the data are discovered, I’ll produce and refine plots to illustrate them. The data is available for download here and background information is available at this link.
Let’s run some basic functions to examine the structure and schema of the data set.
str(df)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(df)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Since we’re primarily interested in quality, it would also be interesting to see basic statistics on that as well.
summary(df$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Some initial observations here:
X appears to be the unique identifier.quality is an ordered, categorical, discrete variable. From the literature, this was on a 0-10 scale, and was rated by at least 3 wine experts. The values ranged only from 3 to 8, with a mean of 5.6 and median of 6..sulfur.dioxide suffixes).fixed.acidity ~ volatile.acidity and free.sulfur.dioxide ~ total.sulfur.dioxide may possible by dependent, subsets of each other.# let's at least explore, clean up, and format the first two points.
# X
df$X = factor(df$X)
# quality
summary(df$quality)
table(df$quality)
# assertion was correct here, so let's ensure the data frame semantically
# reflects that.
df$quality <- factor(df$quality, ordered = T)
str(df$quality)
To first explore this data visually, I’ll draw up quick histograms of all 12 variables. The intention here is to see a quick distribution of the values.
# exploratory, quick histogram plots
grid.arrange(qplot(df$fixed.acidity),
qplot(df$volatile.acidity),
qplot(df$citric.acid),
qplot(df$residual.sugar),
qplot(df$chlorides),
qplot(df$free.sulfur.dioxide),
qplot(df$total.sulfur.dioxide),
qplot(df$density),
qplot(df$pH),
qplot(df$sulphates),
qplot(df$alcohol),
qplot(df$quality),
ncol = 4)
I first looked at wine quality. Although it has a discrete range of only 3-8, we can roughly see that there is some amount of normal distribution. A large majority of the wines examined received ratings of 5 or 6, and very few received 3, 4, or 8. There’s not much more we can do with this histogram, as both decreasing or increasing bin sizes would distort the data.
Given the ratings and distribution of wine quality, I’ll instantiate another categorical variable, classifying the wines as ‘bad’ (rating 0 to 4), ‘average’ (rating 5 or 6), and ‘good’ (rating 7 to 10).
df$rating <- ifelse(df$quality < 5, 'bad', ifelse(
df$quality < 7, 'average', 'good'))
df$rating <- ordered(df$rating,
levels = c('bad', 'average', 'good'))
summary(df$rating)
## bad average good
## 63 1319 217
qplot(df$rating)
ggplot(data = df,
aes(x = fixed.acidity)) +
geom_histogram() +
scale_x_log10()
ggplot(data = df,
aes(x = volatile.acidity)) +
geom_histogram() +
scale_x_log10()
ggplot(data = df,
aes(x = citric.acid)) +
geom_histogram() +
scale_x_log10()
When plotted on a base 10 logarithmic scale, fixed.acidity and volatile.acidity appear to be normally-distributed. This makes sense, considering that pH is normally distributed, and pH, by definition, is a measure of acidity and is on a logarithmic scale. Curiously, however, citric.acid, did not appear to be normally-distributed on a logarithmic scale. Upon further investigation:
length(subset(df, citric.acid == 0)$citric.acid)
## [1] 132
It is apparent that 132 observations had a value of zero. This yields some concerns on whether or not these 132 values were reported or not, considering that the next ‘bin’ higher contains only 32 observations.
p1 <- ggplot(data = df, aes(x = residual.sugar)) +
geom_histogram() +
scale_x_continuous(lim = c(0, quantile(df$residual.sugar, 0.95))) +
xlab('residual.sugar, 95th percentile truncated')
p2 <- p1 + scale_x_log10() + xlab('residual.sugar, log10')
grid.arrange(p1, p2, ncol=1)
## Warning: Removed 79 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
p1 <- ggplot(data = df, aes(x = chlorides)) +
geom_histogram() +
scale_x_continuous(lim = c(0, quantile(df$chlorides, 0.95))) +
xlab('chlorides, 95th percentile truncated')
p2 <- p1 + scale_x_log10() + xlab('chlorides, log10')
grid.arrange(p1, p2, ncol=1)
## Warning: Removed 80 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
p1 <- ggplot(data = df, aes(x = sulphates)) +
geom_histogram() +
scale_x_continuous(lim = c(0, quantile(df$sulphates, 0.95))) +
xlab('sulphates, 95th percentile truncated')
p2 <- p1 + scale_x_log10() + xlab('sulphates, log10')
grid.arrange(p1, p2, ncol=1)
## Warning: Removed 79 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
rm(p1, p2)
While exploring the univariate histogram distributions, there did not appear to be any bimodal or multimodal distributions that would warrant sub-classification into categorical variables. I considered potentially splitting residual.sugar into ‘sweet wine’ and ‘dry wine’, but the Wikipedia source cited a residual sugar of greater than 45 g/L or g/m^3 to classify as a sweet wine.
I instantiated an ordered factor, rating, classifying each wine sample as ‘bad’, ‘average’, or ‘good’.
Upon further examination of the data set documentation, it appears that fixed.acidity and volatile.acidity are different types of acids; tartaric acid and acetic acid. I decided to create a combined variable, TAC.acidity, containing the sum of tartaric, acetic, and citric acid.
df$TAC.acidity <- df$fixed.acidity + df$volatile.acidity + df$citric.acid
qplot(df$TAC.acidity)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I addressed the distributions in the ‘Distributions’ section. Boxplots are better suited in visualizing the outliers.
get_simple_boxplot <- function(column, ylab) {
return(qplot(data = df, x = 'simple',
y = column, geom = 'boxplot',
xlab = '',
ylab = ylab))
}
grid.arrange(get_simple_boxplot(df$fixed.acidity, 'fixed acidity'),
get_simple_boxplot(df$volatile.acidity, 'volatile acidity'),
get_simple_boxplot(df$citric.acid, 'citric acid'),
get_simple_boxplot(df$TAC.acidity, 'TAC acidity'),
get_simple_boxplot(df$residual.sugar, 'residual sugar'),
get_simple_boxplot(df$chlorides, 'chlorides'),
get_simple_boxplot(df$free.sulfur.dioxide, 'free sulf. dioxide'),
get_simple_boxplot(df$total.sulfur.dioxide, 'total sulf. dioxide'),
get_simple_boxplot(df$density, 'density'),
get_simple_boxplot(df$pH, 'pH'),
get_simple_boxplot(df$sulphates, 'sulphates'),
get_simple_boxplot(df$alcohol, 'alcohol'),
ncol = 4)
In univariate analysis, I chose not to tidy or adjust any data, short of plotting a select few on logarithmic scales. Bivariate boxplots, with X as rating or quality, will be more interesting in showing trends with wine quality.
# ggpairs was WAY too slow! Uncomment and use at your own risk!
#set.seed(1)
#df_sample <- df[,-which(names(df) %in% c('X', 'rating'))][sample(1:length(df$quality), 40), ]
#ggpairs(df_sample, params = c(shape = I('.'), outlier.shape = I('.')))
To get a quick snapshot of how the variables affect quality, I generated box plots for each.
get_bivariate_boxplot <- function(x, y, ylab) {
return(qplot(data = df, x = x, y = y, geom = 'boxplot', ylab = ylab))
}
grid.arrange(get_bivariate_boxplot(df$quality, df$fixed.acidity,
'fixed acidity'),
get_bivariate_boxplot(df$quality, df$volatile.acidity,
'volatile acidity'),
get_bivariate_boxplot(df$quality, df$citric.acid,
'citric acid'),
get_bivariate_boxplot(df$quality, df$TAC.acidity,
'TAC acidity'),
get_bivariate_boxplot(df$quality, log10(df$residual.sugar),
'residual sugar'),
get_bivariate_boxplot(df$quality, log10(df$chlorides),
'chlorides'),
get_bivariate_boxplot(df$quality, df$free.sulfur.dioxide,
'free sulf. dioxide'),
get_bivariate_boxplot(df$quality, df$total.sulfur.dioxide,
'total sulf. dioxide'),
get_bivariate_boxplot(df$quality, df$density,
'density'),
get_bivariate_boxplot(df$quality, df$pH,
'pH'),
get_bivariate_boxplot(df$quality, log10(df$sulphates),
'sulphates'),
get_bivariate_boxplot(df$quality, df$alcohol,
'alcohol'),
ncol = 4)
grid.arrange(get_bivariate_boxplot(df$rating, df$fixed.acidity,
'fixed acidity'),
get_bivariate_boxplot(df$rating, df$volatile.acidity,
'volatile acidity'),
get_bivariate_boxplot(df$rating, df$citric.acid,
'citric acid'),
get_bivariate_boxplot(df$rating, df$TAC.acidity,
'TAC acidity'),
get_bivariate_boxplot(df$rating, log10(df$residual.sugar),
'residual sugar'),
get_bivariate_boxplot(df$rating, log10(df$chlorides),
'chlorides'),
get_bivariate_boxplot(df$rating, df$free.sulfur.dioxide,
'free sulf. dioxide'),
get_bivariate_boxplot(df$rating, df$total.sulfur.dioxide,
'total sulf. dioxide'),
get_bivariate_boxplot(df$rating, df$density,
'density'),
get_bivariate_boxplot(df$rating, df$pH,
'pH'),
get_bivariate_boxplot(df$rating, log10(df$sulphates),
'sulphates'),
get_bivariate_boxplot(df$rating, df$alcohol,
'alcohol'),
ncol = 4)
From exploring these plots, it seems that a ‘good’ wine generally has these trends:
Residual sugar and sulfur dioxides did not seem to have a dramatic impact on the quality or rating of the wines. Interestingly, it appears that different types of acid affect wine quality different; as such, TAC.acidity saw an attenuated trend, as the presence of volatile (acetic) acid accompanied decreased quality.
By utilizing cor.test, I calculated the correlation for each of these variables against quality:
simple_cor_test <- function(x, y) {
return(cor.test(x, as.numeric(y))$estimate)
}
correlations <- c(
simple_cor_test(df$fixed.acidity, df$quality),
simple_cor_test(df$volatile.acidity, df$quality),
simple_cor_test(df$citric.acid, df$quality),
simple_cor_test(df$TAC.acidity, df$quality),
simple_cor_test(log10(df$residual.sugar), df$quality),
simple_cor_test(log10(df$chlorides), df$quality),
simple_cor_test(df$free.sulfur.dioxide, df$quality),
simple_cor_test(df$total.sulfur.dioxide, df$quality),
simple_cor_test(df$density, df$quality),
simple_cor_test(df$pH, df$quality),
simple_cor_test(log10(df$sulphates), df$quality),
simple_cor_test(df$alcohol, df$quality))
names(correlations) <- c('fixed.acidity', 'volatile.acidity', 'citric.acid',
'TAC.acidity', 'log10.residual.sugar',
'log10.chlordies', 'free.sulfur.dioxide',
'total.sulfur.dioxide', 'density', 'pH',
'log10.sulphates', 'alcohol')
correlations
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## TAC.acidity log10.residual.sugar log10.chlordies
## 0.10375373 0.02353331 -0.17613996
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05065606 -0.18510029 -0.17491923
## pH log10.sulphates alcohol
## -0.05773139 0.30864193 0.47616632
Quantitatively, it appears that the following variables have relatively higher correlations to wine quality:
Let’s see how these variables compare, plotted against each other and faceted by wine rating:
ggplot(data = df, aes(x = log10(sulphates), y = alcohol)) +
facet_wrap(~rating) +
geom_point()
ggplot(data = df, aes(x = volatile.acidity, y = alcohol)) +
facet_wrap(~rating) +
geom_point()
ggplot(data = df, aes(x = citric.acid, y = alcohol)) +
facet_wrap(~rating) +
geom_point()
ggplot(data = df, aes(x = volatile.acidity, y = log10(sulphates))) +
facet_wrap(~rating) +
geom_point()
ggplot(data = df, aes(x = citric.acid, y = log10(sulphates))) +
facet_wrap(~rating) +
geom_point()
ggplot(data = df, aes(x = citric.acid, y = volatile.acidity)) +
facet_wrap(~rating) +
geom_point()
The relative value of these scatterplots are suspect; if anything, it illustrates how heavily alcohol content affects rating. The weakest bivariate relationship appeared to be alcohol vs. citric acid. The plots were nearly uniformly-distributed. The strongest relationship appeared to be volatile acididty vs. citric acid, which had a negative correlation.
Examining the acidity variables, I saw strong correlations between them:
ggplot(data = df, aes(x = fixed.acidity, y = citric.acid)) +
geom_point()
cor.test(df$fixed.acidity, df$citric.acid)
##
## Pearson's product-moment correlation
##
## data: df$fixed.acidity and df$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
ggplot(data = df, aes(x = volatile.acidity, y = citric.acid)) +
geom_point()
cor.test(df$volatile.acidity, df$citric.acid)
##
## Pearson's product-moment correlation
##
## data: df$volatile.acidity and df$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
ggplot(data = df, aes(x = log10(TAC.acidity), y = pH)) +
geom_point()
cor.test(log10(df$TAC.acidity), df$pH)
##
## Pearson's product-moment correlation
##
## data: log10(df$TAC.acidity) and df$pH
## t = -39.663, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7283140 -0.6788653
## sample estimates:
## cor
## -0.7044435
Most notably, base 10 logarithm TAC.acidity correlated very well with pH. This is certainly expected, as pH is essentially a measure of acidity. An interesting question to pose, using basic chemistry knowledge, is to ask what other components other than the measured acids are affecting pH. We can quantify this difference by building a predictive linear model, to predict pH based off of TAC.acidity and capture the % difference as a new variable.
m <- lm(I(pH) ~ I(log10(TAC.acidity)), data = df)
df$pH.predictions <- predict(m, df)
# (observed - expected) / expected
df$pH.error <- (df$pH.predictions - df$pH)/df$pH
ggplot(data = df, aes(x = quality, y = pH.error)) +
geom_boxplot()
The median % error hovered at or near zero for most wine qualities. Notably, wines rated with a quality of 3 had large negative error. We can interpret this finding by saying that for many of the ‘bad’ wines, total acidity from tartaric, acetic, and citric acids were a worse predictor of pH. Simply put, it is likely that there were other components–possibly impurities–that changed and affected the pH.
As annotated previously, I hypothesized that free.sulfur.dioxide and total.sulfur.dioxide were dependent on each other. Plotting this:
ggplot(data = df, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) +
geom_point() +
geom_smooth()
cor.test(df$free.sulfur.dioxide, df$total.sulfur.dioxide)
##
## Pearson's product-moment correlation
##
## data: df$free.sulfur.dioxide and df$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
It is clear that there is a very strong relationship between the two. Aside from TAC.acidity, this seemed to be the strongest bivariate relationship. Additionally, despite the telling name descriptions, the clear ‘floor’ on this graph hints that free.sulfur.dioxide is a subset of total.sulfur.dioxide.
ggplot(data = df,
aes(x = citric.acid, y = volatile.acidity,
color = quality)) +
geom_point() +
facet_wrap(~rating)
ggplot(data = df,
aes(x = alcohol, y = log10(sulphates),
color = quality)) +
geom_point() +
facet_wrap(~rating)
ggplot(data = df,
aes(x = pH, y = alcohol, color = quality)) +
geom_point() +
facet_wrap(~rating)
I primarily examined the 4 features which showed high correlation with quality. These scatterplots were a bit crowded, so I faceted by rating to illustrate the population differences between good wines, average wines, and bad wines. It’s clear that a higher citric acid and lower volatile (acetic) acid contributes towards better wines. Likewise, better wines tended to have higher sulphates and alcohol content. Surprisingly, pH had very little visual impact on wine quality, and was shadowed by the larger impact of alcohol. Interestingly, this shows that what makes a good wine depends on the type of acids that are present.
grid.arrange(ggplot(data = df, aes(x = quality, y = fixed.acidity,
fill = quality)) +
ylab('Fixed Acidity (g/dm^3)') +
xlab('Quality') +
geom_boxplot(),
ggplot(data = df, aes(x = quality, y = volatile.acidity,
fill = quality)) +
ylab('Volatile Acidity (g/dm^3)') +
xlab('Quality') +
geom_boxplot(),
ggplot(data = df, aes(x = quality, y = citric.acid,
fill = quality)) +
ylab('Citric Acid (g/dm^3)') +
xlab('Quality') +
geom_boxplot(),
ggplot(data = df, aes(x = quality, y = pH,
fill = quality)) +
ylab('pH') +
xlab('Quality') +
geom_boxplot())
These subplots were created to demonstrate the effect of acidity and pH on wine quality. Generally, higher acidity (or lower pH) is seen in highly-rated wines. To caveat this, a presence of volatile (acetic) acid negatively affected wine quality. Citric acidity had a high correlation with wine quality, while fixed (tartaric) acid had a smaller impact.
ggplot(data = df, aes(x = quality, y = alcohol,
fill = rating)) +
geom_boxplot() +
ggtitle('Alcohol Levels in Different Wine Qualities') +
xlab('Quality') +
ylab('Alcohol (% volume)')
These boxplots demonstrate the effect of alcohol content on wine quality. Generally, higher alcohol content correlated with higher wine quality. However, as the outliers and intervals show, alchol content alone did not produce a higher quality.
ggplot(data = subset(df, rating != 'average'),
aes(x = volatile.acidity, y = alcohol,
color = rating)) +
geom_point() +
ggtitle('Alcohol vs. Volatile Acidity and Wine Quality') +
xlab('Volatile Acidity (g / dm^3)') +
ylab('Alcohol (% volume)')
This is perhaps the most telling graph. I subsetted the data to remove the ‘average’ wines, or any wine with a rating of 5 or 6. As the correlation tests show, wine quality was affected most strongly by alcohol and volaticle acidity. While the boundaries are not as clear cut or modal, it’s apparent that high volatile acidity–with few exceptions–kept wine quality down. A combination of high alcohol content and low volatile acidity produced better wines.
Through this exploratory data analysis, I was able to identify the key factors that determine and drive wine quality, mainly: alcohol content, sulphates, and acidity. It is important to note, however, that wine quality is ultimately a subjective measure, albeit measured by wine experts. That said, the correlations for these variables are within reasonable bounds. The graphs adequately illustrate the factors that make good wines ‘good’ and bad wines ‘bad’. Further study with inferential statistics could be done to quantitatively confirm these assertions.